ABSTRACT
Traditional parallel processing models, such as BSP, are "scale up" based, aiming to achieve high performance by increasing computing power, interconnection network bandwidth, and memory/storage capacity within dedicated systems, while big data analytics tasks aiming for high throughput demand that large distributed systems "scale out" by continuously adding computing and storage resources through networks. Each one of the "scale up" model and "scale out" model has a different set of performance requirements and system bottlenecks. In this paper, we develop a general model that abstracts critical computation and communication behavior and computation-communication interactions for big data analytics in a scalable and fault-tolerant manner. Our model is called DOT, represented by three matrices for data sets (D), concurrent data processing operations (O), and data transformations (T), respectively. With the DOT model, any big data analytics job execution in various software frameworks can be represented by a specific or non-specific number of elementary/composite DOT blocks, each of which performs operations on the data sets, stores intermediate results, makes necessary data transfers, and performs data transformations in the end. The DOT model achieves the goals of scalability and fault-tolerance by enforcing a data-dependency-free relationship among concurrent tasks. Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation independent, and applicable to a wide variety of big data analytics jobs. Finally, we demonstrate the effectiveness of the DOT model through several case studies.
- http://hadoop.apache.org/.Google Scholar
- http://en.wikipedia.org/wiki/Recurrence_relation.Google Scholar
- http://en.wikipedia.org/wiki/K-means_clustering.Google Scholar
- http://www.tpc.org/tpch/.Google Scholar
- http://aws.amazon.com/ec2/.Google Scholar
- http://en.wikipedia.org/wiki/Parallel_Random_Access_Machine.Google Scholar
- A. Abouzied, K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB, Lyon, France, 2009. Google ScholarDigital Library
- R. Avnur and J. M. Hellerstein. Eddies: Continuously Adaptive Query Processing. In SIGMOD, 2000. Google ScholarDigital Library
- S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In CIDR, 2003.Google Scholar
- D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, E. E. Santos, K. E. Schauser, R. Subramonian, and T. von Eicken. LogP: A Practical Model of Parallel Computation. Commun. ACM, 39(11):78--85, 1996. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
- J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
- J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On Distributing Symmetric Streaming Computations. In SODA, 2008. Google ScholarDigital Library
- J. Gray. Distributed Computing Economics. ACM Queue, 6(3):63--68, 2008. Google ScholarDigital Library
- Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011. Google ScholarDigital Library
- B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. A Common Substrate for Cluster Computing. In HotCloud'09, 2009. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007. Google ScholarDigital Library
- H. J. Karloff, S. Suri, and S. Vassilvitskii. A Model of Computation for MapReduce. In SODA, 2010. Google ScholarDigital Library
- W. Kim. On Optimizing an SQL-like Nested Query. ACM Trans. Database Syst., 7(3):443--469, 1982. Google ScholarDigital Library
- R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet another SQL-to-MapReduce Translator. In ICDCS, 2011. Google ScholarDigital Library
- R. Lee, M. Zhou, and H. Liao. Request Window: an Approach to Improve Throughput of RDBMS-based Data Integration System by Utilizing Data Sharing Across Concurrent Distributed Queries. In VLDB, 2007. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, 2010. Google ScholarDigital Library
- D. G. Murray and S. Hand. Ciel: a universal execution engine for distributed data-flow computing. In NSDI '11, 2011. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The Pagerank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab.Google Scholar
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD Conference, 2009. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
- L. G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarDigital Library
- Y. Yu, P. K. Gunda, and M. Isard. Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations. In SOSP, 2009. Google ScholarDigital Library
- X. Zhang, Y. Yan, and K. He. Latency Metric: An Experimental Method for Measuring and Evaluating Parallel Program and Architecture Scalability. J. Parallel Distrib. Comput., 22(3):392--410, 1994. Google ScholarDigital Library
Index Terms
- DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems
Recommendations
A Brief Survey on Big Data in Healthcare
This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition ...
Responsible Big Data Analytics for E-Business Services
ICBDR '21: Proceedings of the 5th International Conference on Big Data ResearchThis paper examines responsible big data analytics for e-business services and looks at how to use responsible big data analytics to obtain responsible e-business services. It addresses why responsibility matters to big data analytics and e-business ...
Beyond the hype
We define what is meant by big data.We review analytics techniques for text, audio, video, and social media data.We make the case for new statistical techniques for big data.We highlight the expected future developments in big data analytics. Size is ...
Comments